Dictionary Methods — Word counts
SPSP 2020In this module, we will explore dictionary-based word count approaches text analysis measurement.
We will be working with U.S. House floor speeches spanning June 1998 through July 1999. More specifically, we will:
# Load packages
library(pacman)
p_load(readr, dplyr, tidyr, ggplot2, jtools,
knitr, reshape2, jsonlite, lubridate, sentimentr, magrittr)
In the preprocessing module (../notebooks/tdta_preprocessing.Rmd), we saved a tidy format data.frame containing our corpus. We’ll use that as a starting point for our work in this module.
dat_tidy <- readRDS('../data/tdta_clean_house_data_tidy.RDS')
dat_tidy
NA
In this data.frame, words are stored in the cells of the column word. Accordingly, an entire document, identified by the unique document identifier doc_num, is stored across multiple rows. This data.frame also contains metadata like the name of the speaker associated with a given document, as well as the the speaker’s party, district, and State.
If you have enough data, there’s really never any reason to not separate your data into a training set and testing set. Of course, determing how much data is enough can be tricky, because you don’t want to create a situation where you are underpowered or unable to estimate a target parameter with sufficient precision.
However, treating documents as our units of observation, we have an $N = $ 35,959, which is probably large enough for splitting.
doc_ids <- unique(dat_tidy$doc_num) # Get document IDs
n_docs = .50 * length(doc_ids) # Calculate number of documents to sample
set.seed(5435) # set seed for reproducibility
doc_ids_test = sample(doc_ids, n_docs) # sample document IDs for test data
dat_tidy.train <- dat_tidy %>%
filter(doc_num %!in% doc_ids_test) # Select documents for training
dat_tidy.test <- dat_tidy %>%
filter(doc_num %in% doc_ids_test) # Select documents for test
To conduct word count analyses, you need a dictionary or lexicon that specifies the words associated with your target construct(s).
To start, we’ll work with the NRC sentiment dictionary, which is one of three sentiment dictionaries packaged with tidytext:
To access the NRC dictionary, we’ll use the get_sentiments function to store the NRC dictionary in an object called nrc_sent.
nrc_sent <- get_sentiments('nrc')
nrc_sent
nrc_sent contains two columns, word, which contains the words in the dictionary, and sentiment, which specifies the sentiment label associated with the word.
nrc_sent %>%
count(sentiment)
The NRC dictionary contains 10 sentiment categories and each of these categories have varying numbers of words associated with them.
Let’s take a glance at first words in each category by spreading our tidy data.frame:
nrc_sent %>%
group_by(sentiment) %>%
mutate(temp_id = row_number()) %>% # Create a temporary ID to weight top_n by
top_n(n = -50, wt=temp_id) %>% # Get first 50 items in each group
mutate(temp_id = row_number()) %>% # Create a temporary unique ID for each word in each group
ungroup() %>%
pivot_wider(names_from = sentiment, values_from = word) %>% # Spread our data
select(-temp_id)
Glancing at these words, it’s clear that words are repeated in some categories.
Question: What other characteristics stand out, if any?
tidytext word count sentiment analysisIn principle, tidytext makes simple dictionary-based word count sentiment analysis quite simple.
To count the words we can just:
inner_join between our data and our sentiment dictionaryWe’ll also divide the number of matches for each sentiment domain by the total number of words in our corpus. This will tell us the proportion associated with each sentiment domain.
total_words = nrow(dat_tidy.train)
dat_tidy.train %>%
inner_join(nrc_sent) %>%
count(sentiment) %>%
mutate(prop = n/total_words) %>%
arrange(desc(prop))
Joining, by = "word"
We can also subset our data in order to ask more specific questions. For instance, we can easily estimate sentiment proportions for Democrats and Republicans.
dat_tidy.train.sent <- dat_tidy.train %>%
group_by(Party) %>% # Group by Party
mutate(total_words = n()) %>% # Calculate the total words in each group
ungroup() %>%
inner_join(nrc_sent) # Drop words that aren't in sentiment dictionary
Joining, by = "word"
dat_tidy.train.sent %>% count(total_words, Party, sentiment) %>% # Count the number of rows in each Party for each sentiment
mutate(prop = n/total_words) %>% # Calculate the proportion
arrange(desc(prop)) %>% # Arrange in descending order by proportion positive
select(-n, -total_words) %>%
pivot_wider(names_from='Party', values_from = 'prop')
NA
NA
NA
Overall, it looks like there is very little mean sentiment variation between Republicans and Democrats. However, we’ve collapsed across documents. To get a better idea of how expressions of affective sentiment vary across Parties, let’s visualize the distribution of sentiment in documents
dat_tidy.train.sent %>%
filter(Party !='Independent') %>%
count(total_words, Party, doc_num, sentiment) %>%
mutate(prop = n/total_words) %>%
ggplot(aes(y = prop, x = Party, color=Party)) +
geom_jitter(alpha=.25) +
facet_wrap(.~sentiment, ncol=5) +
scale_colour_manual(values = c("blue", "red")) +
theme_apa() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle('Document-level proportions of sentiment words by party') +
xlab('Party') +
ylab('Proportion')
What if we look at the document Ns of sentiment words instead of proportions?
dat_tidy.train.sent %>%
filter(Party !='Independent') %>%
count(total_words, Party, doc_num, sentiment) %>%
mutate(prop = n/total_words) %>%
ggplot(aes(y = n, x = Party, color=Party)) +
geom_jitter(alpha=.25) +
facet_wrap(.~sentiment, ncol=5) +
scale_colour_manual(values = c("blue", "red")) +
theme() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle('Document-level Ns of sentiment words by party') +
xlab('Party') +
ylab('N')
Question: Why might we want to look at proportion vs. N (or vice versa)?
Clearly, there isn’t much marginal between group variation in affective sentiment in this dataset. However, maybe there are interesting effects operating at other levels!
Let’s examine temporal variation in affective sentiment between parties. To do this, we will take a similar approach, but instead of counting the sentiment in each document, we’ll count the sentiment on each observed day.
First, however, let’s glance at the distribution of documents across time. For reference, let’s also add vertical lines to indicate the dates on which Clinton’s impeachment was iniated and voted on.
sent_time <- dat_tidy.train.sent %>%
filter(Party !='Independent') %>%
distinct(doc_num, .keep_all = T) %>%
count(date, Party) %>%
ggplot(aes(y = n, x = date, color=Party)) +
geom_line(alpha=.5) +
facet_wrap(Party ~. , ncol=1) +
theme_apa() +
geom_vline(xintercept=as.numeric(as_datetime('1998-10-08')), linetype=2, alpha=.25) +
geom_vline(xintercept=as.numeric(as_datetime('1998-12-19')), linetype=2, alpha=.25) +
geom_point() +
theme_apa() +
ggtitle('N of documents across time by party') +
ylab('N') +
xlab('Date')
ggplotly(sent_time)
NA
NA
Clearly, there is substantial variation in the number of documents (i.e. speeches given by individual speakers) across time.
Question: What are our sample sizes on the impeachment-relevant days?
Now, let’s plot sentiment across time by party.
dat_tidy.train.sent %>%
filter(Party !='Independent') %>%
count(total_words, Party, date, sentiment) %>%
mutate(prop = n/total_words) %>%
ggplot(aes(y = n, x = date, color=Party)) +
facet_wrap(sentiment~Party, ncol=4) +
scale_colour_manual(values = c("blue", "red")) +
theme() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle('Document-level Ns of sentiment words by party') +
xlab('Party') +
ylab('N') +
geom_smooth(color='black') +
geom_vline(xintercept=as.numeric(as_datetime('1998-10-08')), linetype=2) +
geom_vline(xintercept=as.numeric(as_datetime('1998-12-19')), linetype=2) +
geom_point(alpha=.25)
dat_tidy.train.sent %>%
filter(Party !='Independent') %>%
count(total_words, Party, date, sentiment) %>%
mutate(prop = n/total_words) %>%
ggplot(aes(y = prop, x = date, color=Party)) +
facet_wrap(sentiment~Party, ncol=4) +
scale_colour_manual(values = c("blue", "red")) +
theme() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle('Document-level proportions of sentiment words by party') +
xlab('Party') +
ylab('N') +
geom_smooth(color='black') +
geom_vline(xintercept=as.numeric(as_datetime('1998-10-08')), linetype=2) +
geom_vline(xintercept=as.numeric(as_datetime('1998-12-19')), linetype=2) +
geom_point(alpha=.25)
Question: What can we learn from this figure?
dat_tidy.train.speaker <- dat_tidy.train %>%
group_by(Party, speaker) %>%
mutate(speaker_total_words = n()) %>%
ungroup() %>%
inner_join(nrc_sent) %>%
count(Party, speaker, speaker_total_words, date, sentiment) %>%
group_by(speaker, sentiment) %>%
mutate(speaker_sent_means = mean(n),
speaker_sent_cntr = n - speaker_sent_means)
Joining, by = "word"
dat_tidy.train.speaker <- dat_tidy.train %>%
filter(Party != 'Independent') %>%
group_by(Party, speaker, date) %>%
mutate(speaker_day_n = n()) %>%
ungroup() %>%
inner_join(nrc_sent) %>%
count(Party, speaker, date, speaker_day_n, sentiment) %>%
mutate(speaker_day_sent_prop = n/speaker_day_n)
Joining, by = "word"
# Create a date range from the min/max dates in our training data
date_grid <- tibble(date = seq(min(dat_tidy.train.speaker$date),
max(dat_tidy.train.speaker$date), by='days')) %>%
mutate(date_int = row_number(), # Associate each date with an integer
date_int_scaled = date_int/100)
dat_tidy.train.speaker <- dat_tidy.train.speaker %>%
left_join(date_grid) %>%
mutate(date_int_scaled = date_int/100,
impeachment_1 = ifelse(date == as_datetime('1998-10-08'), 1, 0),
impeachment_2 = ifelse(date == as_datetime('1998-12-19'), 1, 0))
Joining, by = "date"
train.speaker.negative.m1 <- dat_tidy.train.speaker %>%
filter(sentiment=='negative') %>%
lmer(speaker_day_sent_prop ~ 1 + Party*impeachment_1 + Party*impeachment_2 + (1 | speaker) + (1 + Party | date_int), data=.)
summary(train.speaker.negative.m1)
Linear mixed model fit by REML ['lmerMod']
Formula:
speaker_day_sent_prop ~ 1 + Party * impeachment_1 + Party * impeachment_2 +
(1 | speaker) + (1 + Party | date_int)
Data: .
REML criterion at convergence: -34881.5
Scaled residuals:
Min 1Q Median 3Q Max
-2.7629 -0.6398 -0.1575 0.4602 8.0026
Random effects:
Groups Name Variance Std.Dev. Corr
speaker (Intercept) 2.464e-05 0.0049634
date_int (Intercept) 2.916e-05 0.0054003
PartyRepublican 7.967e-07 0.0008926 -0.42
Residual 2.760e-04 0.0166138
Number of obs: 6630, groups: speaker, 638; date_int, 144
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.0307114 0.0006638 46.265
PartyRepublican -0.0001991 0.0006508 -0.306
impeachment_1 0.0039442 0.0058847 0.670
impeachment_2 0.0218401 0.0063649 3.431
PartyRepublican:impeachment_1 0.0004287 0.0033720 0.127
PartyRepublican:impeachment_2 -0.0002016 0.0048421 -0.042
Correlation of Fixed Effects:
(Intr) PrtyRp impc_1 impc_2 PrR:_1
PartyRpblcn -0.522
impechmnt_1 -0.081 0.026
impechmnt_2 -0.079 0.028 0.010
PrtyRpbl:_1 0.045 -0.081 -0.365 -0.007
PrtyRpbl:_2 0.036 -0.064 -0.005 -0.426 0.013
dat_tidy.train.speaker.pred_grid <- expand.grid(Party = unique(dat_tidy.train.speaker$Party),
speaker = 'new_speaker',
date_int = unique(dat_tidy.train.speaker$date_int))
dat_tidy.train.speaker.pred_grid <- dat_tidy.train.speaker.pred_grid %>%
left_join(date_grid) %>%
mutate(impeachment_1 = ifelse(date == as_datetime('1998-10-08'), 1, 0),
impeachment_2 = ifelse(date == as_datetime('1998-12-19'), 1, 0))
Joining, by = "date_int"
train.speaker.negative.m1.pred <- dat_tidy.train.speaker.pred_grid %>%
mutate(preds = predict(train.speaker.negative.m1, newdata=dat_tidy.train.speaker.pred_grid, allow.new.levels=T))
train.speaker.negative.m1.pred %>%
left_join(date_grid) %>%
ggplot(aes(x = date, y = preds, color=Party)) +
geom_line() +
geom_point(aes(y = preds), alpha=.25) +
theme_apa() +
scale_colour_manual(values = c("blue", "red")) +
theme_apa() +
geom_vline(xintercept=as.numeric(as_datetime('1998-10-08')), linetype=2, alpha=.25) +
geom_vline(xintercept=as.numeric(as_datetime('1998-12-19')), linetype=2, alpha=.25) +
ggtitle('Daily expected proportion of negative language for an average speaker' ) +
ylab('Speaker proportion negative language') +
xlab('Date')
Joining, by = c("date_int", "date", "date_int_scaled")
NA
train.speaker.disgust.m1 <- dat_tidy.train.speaker %>%
filter(sentiment=='disgust') %>%
lmer(speaker_day_sent_prop ~ 1 + Party*impeachment_1 + Party*impeachment_2 + (1 | speaker) + (1 + Party | date_int), data=.)
summary(train.speaker.disgust.m1)
Linear mixed model fit by REML ['lmerMod']
Formula:
speaker_day_sent_prop ~ 1 + Party * impeachment_1 + Party * impeachment_2 +
(1 | speaker) + (1 + Party | date_int)
Data: .
REML criterion at convergence: -37281.3
Scaled residuals:
Min 1Q Median 3Q Max
-2.1327 -0.6168 -0.2363 0.3417 11.8113
Random effects:
Groups Name Variance Std.Dev. Corr
speaker (Intercept) 3.769e-06 0.0019413
date_int (Intercept) 4.166e-06 0.0020410
PartyRepublican 1.502e-07 0.0003876 0.15
Residual 5.562e-05 0.0074581
Number of obs: 5428, groups: speaker, 576; date_int, 141
Fixed effects:
Estimate Std. Error t value
(Intercept) 9.484e-03 2.782e-04 34.086
PartyRepublican 3.560e-04 2.948e-04 1.207
impeachment_1 1.546e-03 2.342e-03 0.660
impeachment_2 6.181e-03 2.556e-03 2.419
PartyRepublican:impeachment_1 2.194e-04 1.651e-03 0.133
PartyRepublican:impeachment_2 -3.549e-05 2.214e-03 -0.016
Correlation of Fixed Effects:
(Intr) PrtyRp impc_1 impc_2 PrR:_1
PartyRpblcn -0.516
impechmnt_1 -0.085 0.029
impechmnt_2 -0.085 0.034 0.011
PrtyRpbl:_1 0.043 -0.088 -0.301 -0.008
PrtyRpbl:_2 0.041 -0.078 -0.006 -0.391 0.014
train.speaker.disgust.m1.pred <- dat_tidy.train.speaker.pred_grid %>%
mutate(preds = predict(train.speaker.disgust.m1, newdata=dat_tidy.train.speaker.pred_grid, allow.new.levels=T))
train.speaker.disgust.m1.pred %>%
left_join(date_grid) %>%
ggplot(aes(x = date, y = preds, color=Party)) +
geom_line() +
geom_point(aes(y = preds), alpha=.25) +
theme_apa() +
scale_colour_manual(values = c("blue", "red")) +
theme_apa() +
geom_vline(xintercept=as.numeric(as_datetime('1998-10-08')), linetype=2, alpha=.25) +
geom_vline(xintercept=as.numeric(as_datetime('1998-12-19')), linetype=2, alpha=.25) +
ggtitle('Daily expected proportion of disgust language for an average speaker' ) +
ylab('Speaker proportion disgust language') +
xlab('Date')
Joining, by = c("date_int", "date", "date_int_scaled")
train.speaker.all.m1 <- dat_tidy.train.speaker %>%
lmer(speaker_day_sent_prop ~ 1 + Party*impeachment_1 + Party*impeachment_2 + (1 | speaker) + (1 + Party | date_int) + (1 + impeachment_1 + impeachment_2 | sentiment), data=.)
Model failed to converge with max|grad| = 0.607011 (tol = 0.002, component 1)
summary(train.speaker.all.m1)
Linear mixed model fit by REML ['lmerMod']
Formula:
speaker_day_sent_prop ~ 1 + Party * impeachment_1 + Party * impeachment_2 +
(1 | speaker) + (1 + Party | date_int) + (1 + impeachment_1 +
impeachment_2 | sentiment)
Data: .
REML criterion at convergence: -337962
Scaled residuals:
Min 1Q Median 3Q Max
-4.1494 -0.5209 -0.1410 0.3412 12.8740
Random effects:
Groups Name Variance Std.Dev. Corr
speaker (Intercept) 2.339e-05 0.004837
date_int (Intercept) 4.772e-06 0.002184
PartyRepublican 4.694e-06 0.002166 -0.29
sentiment (Intercept) 4.302e-04 0.020741
impeachment_1 3.693e-06 0.001922 0.65
impeachment_2 4.950e-05 0.007036 0.42 0.23
Residual 2.762e-04 0.016619
Number of obs: 63417, groups: speaker, 666; date_int, 145; sentiment, 10
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.0277547 0.0065691 4.225
PartyRepublican 0.0004757 0.0004813 0.989
impeachment_1 0.0031754 0.0023994 1.323
impeachment_2 0.0050971 0.0033087 1.540
PartyRepublican:impeachment_1 -0.0017793 0.0024291 -0.733
PartyRepublican:impeachment_2 0.0014619 0.0026687 0.548
Correlation of Fixed Effects:
(Intr) PrtyRp impc_1 impc_2 PrR:_1
PartyRpblcn -0.034
impechmnt_1 0.162 0.017
impechmnt_2 0.281 0.015 0.047
PrtyRpbl:_1 0.001 -0.047 -0.335 -0.004
PrtyRpbl:_2 0.001 -0.046 -0.005 -0.290 0.010
convergence code: 0
Model failed to converge with max|grad| = 0.607011 (tol = 0.002, component 1)
dat_tidy.train.speaker.pred_grid.all_sent <- expand.grid(Party = unique(dat_tidy.train.speaker$Party),
speaker = 'new_speaker',
date_int = unique(dat_tidy.train.speaker$date_int),
sentiment=unique(dat_tidy.train.speaker$sentiment))
dat_tidy.train.speaker.pred_grid.all_sent <- dat_tidy.train.speaker.pred_grid.all_sent %>%
left_join(date_grid) %>%
mutate(impeachment_1 = ifelse(date == as_datetime('1998-10-08'), 1, 0),
impeachment_2 = ifelse(date == as_datetime('1998-12-19'), 1, 0))
Joining, by = "date_int"
train.speaker.all.m1.pred <- dat_tidy.train.speaker.pred_grid.all_sent %>%
mutate(preds = predict(train.speaker.all.m1,
newdata=dat_tidy.train.speaker.pred_grid.all_sent,
allow.new.levels=T))
train.speaker.all.m1.pred %>%
left_join(date_grid) %>%
ggplot(aes(x = date, y = preds, color=Party)) +
geom_line() +
geom_point(aes(y = preds), alpha=.25) +
theme_apa() +
scale_colour_manual(values = c("blue", "red")) +
theme_apa() +
geom_vline(xintercept=as.numeric(as_datetime('1998-10-08')), linetype=2, alpha=.25) +
geom_vline(xintercept=as.numeric(as_datetime('1998-12-19')), linetype=2, alpha=.25) +
ggtitle('Daily expected proportion of disgust language for an average speaker' ) +
ylab('Speaker proportion disgust language') +
xlab('Date') + facet_wrap(sentiment~., ncol=2)
Joining, by = c("date_int", "date", "date_int_scaled")
train.speaker.all.m1.pred %>%
left_join(date_grid) %>%
ggplot(aes(x = date, y = preds, color=Party)) +
geom_line() +
geom_point(aes(y = preds), alpha=.25) +
theme_apa() +
scale_colour_manual(values = c("blue", "red")) +
theme_apa() +
geom_vline(xintercept=as.numeric(as_datetime('1998-10-08')), linetype=2, alpha=.25) +
geom_vline(xintercept=as.numeric(as_datetime('1998-12-19')), linetype=2, alpha=.25) +
ggtitle('Daily expected proportion of disgust language for an average speaker' ) +
ylab('Speaker proportion disgust language') +
xlab('Date') + facet_wrap(sentiment~., ncol=2, scales='free_y')
Joining, by = c("date_int", "date", "date_int_scaled")
sjPlot::plot_model(train.speaker.all.m1, type='re')[3]
[[1]]
dat_tidy.test.speaker <- dat_tidy.test %>%
group_by(Party, speaker) %>%
mutate(speaker_total_words = n()) %>%
ungroup() %>%
inner_join(nrc_sent) %>%
count(Party, speaker, speaker_total_words, date, sentiment) %>%
group_by(speaker, sentiment)
Joining, by = "word"
dat_tidy.test.speaker <- dat_tidy.test %>%
filter(Party != 'Independent') %>%
group_by(Party, speaker, date) %>%
mutate(speaker_day_n = n()) %>%
ungroup() %>%
inner_join(nrc_sent) %>%
count(Party, speaker, date, speaker_day_n, sentiment) %>%
mutate(speaker_day_sent_prop = n/speaker_day_n)
Joining, by = "word"
# Create a date range from the min/max dates in our testing data
date_grid <- tibble(date = seq(min(dat_tidy.test.speaker$date),
max(dat_tidy.test.speaker$date), by='days')) %>%
mutate(date_int = row_number(), # Associate each date with an integer
date_int_scaled = date_int/100)
dat_tidy.test.speaker <- dat_tidy.test.speaker %>%
left_join(date_grid) %>%
mutate(date_int_scaled = date_int/100,
impeachment_1 = ifelse(date == as_datetime('1998-10-08'), 1, 0),
impeachment_2 = ifelse(date == as_datetime('1998-12-19'), 1, 0))
Joining, by = "date"
test.speaker.negative.m1 <- dat_tidy.test.speaker %>%
filter(sentiment=='negative') %>%
lmer(speaker_day_sent_prop ~ 1 + Party*impeachment_1 + Party*impeachment_2 + (1 | speaker) + (1 + Party | date_int), data=.)
boundary (singular) fit: see ?isSingular
summary(test.speaker.negative.m1)
Linear mixed model fit by REML ['lmerMod']
Formula:
speaker_day_sent_prop ~ 1 + Party * impeachment_1 + Party * impeachment_2 +
(1 | speaker) + (1 + Party | date_int)
Data: .
REML criterion at convergence: -34917.8
Scaled residuals:
Min 1Q Median 3Q Max
-2.3963 -0.6268 -0.1663 0.4381 11.4546
Random effects:
Groups Name Variance Std.Dev. Corr
speaker (Intercept) 2.261e-05 0.0047553
date_int (Intercept) 3.451e-05 0.0058743
PartyRepublican 6.270e-07 0.0007919 -1.00
Residual 3.021e-04 0.0173800
Number of obs: 6743, groups: speaker, 652; date_int, 143
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.0313529 0.0006947 45.134
PartyRepublican -0.0002498 0.0006483 -0.385
impeachment_1 0.0090151 0.0063993 1.409
impeachment_2 0.0173639 0.0070045 2.479
PartyRepublican:impeachment_1 -0.0083012 0.0033936 -2.446
PartyRepublican:impeachment_2 0.0058701 0.0053485 1.098
Correlation of Fixed Effects:
(Intr) PrtyRp impc_1 impc_2 PrR:_1
PartyRpblcn -0.546
impechmnt_1 -0.083 0.032
impechmnt_2 -0.078 0.031 0.009
PrtyRpbl:_1 0.056 -0.090 -0.500 -0.008
PrtyRpbl:_2 0.038 -0.062 -0.005 -0.506 0.014
convergence code: 0
boundary (singular) fit: see ?isSingular
dat_tidy.test.speaker.pred_grid <- expand.grid(Party = unique(dat_tidy.test.speaker$Party),
speaker = 'new_speaker',
date_int = unique(dat_tidy.test.speaker$date_int))
dat_tidy.test.speaker.pred_grid <- dat_tidy.test.speaker.pred_grid %>%
left_join(date_grid) %>%
mutate(impeachment_1 = ifelse(date == as_datetime('1998-10-08'), 1, 0),
impeachment_2 = ifelse(date == as_datetime('1998-12-19'), 1, 0))
Joining, by = "date_int"
test.speaker.negative.m1.pred <- dat_tidy.test.speaker.pred_grid %>%
mutate(preds = predict(test.speaker.negative.m1, newdata=dat_tidy.test.speaker.pred_grid, allow.new.levels=T))
test.speaker.negative.m1.pred %>%
left_join(date_grid) %>%
ggplot(aes(x = date, y = preds, color=Party)) +
geom_line() +
geom_point(aes(y = preds), alpha=.25) +
theme_apa() +
scale_colour_manual(values = c("blue", "red")) +
theme_apa() +
geom_vline(xintercept=as.numeric(as_datetime('1998-10-08')), linetype=2, alpha=.25) +
geom_vline(xintercept=as.numeric(as_datetime('1998-12-19')), linetype=2, alpha=.25) +
ggtitle('Daily expected proportion of negative language for an average speaker' ) +
ylab('Speaker proportion negative language') +
xlab('Date')
Joining, by = c("date_int", "date", "date_int_scaled")
NA
test.speaker.disgust.m1 <- dat_tidy.test.speaker %>%
filter(sentiment=='disgust') %>%
lmer(speaker_day_sent_prop ~ 1 + Party*impeachment_1 + Party*impeachment_2 + (1 | speaker) + (1 + Party | date_int), data=.)
boundary (singular) fit: see ?isSingular
summary(test.speaker.disgust.m1)
Linear mixed model fit by REML ['lmerMod']
Formula:
speaker_day_sent_prop ~ 1 + Party * impeachment_1 + Party * impeachment_2 +
(1 | speaker) + (1 + Party | date_int)
Data: .
REML criterion at convergence: -37469.3
Scaled residuals:
Min 1Q Median 3Q Max
-2.7009 -0.5983 -0.2280 0.3244 11.5548
Random effects:
Groups Name Variance Std.Dev. Corr
speaker (Intercept) 5.295e-06 0.002301
date_int (Intercept) 4.316e-06 0.002078
PartyRepublican 1.797e-08 0.000134 -1.00
Residual 5.818e-05 0.007628
Number of obs: 5497, groups: speaker, 570; date_int, 143
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.0100217 0.0002945 34.027
PartyRepublican -0.0002112 0.0003215 -0.657
impeachment_1 0.0029204 0.0024163 1.209
impeachment_2 0.0050296 0.0027192 1.850
PartyRepublican:impeachment_1 -0.0014567 0.0016049 -0.908
PartyRepublican:impeachment_2 0.0026842 0.0023911 1.123
Correlation of Fixed Effects:
(Intr) PrtyRp impc_1 impc_2 PrR:_1
PartyRpblcn -0.566
impechmnt_1 -0.082 0.032
impechmnt_2 -0.074 0.030 0.010
PrtyRpbl:_1 0.053 -0.089 -0.456 -0.008
PrtyRpbl:_2 0.037 -0.065 -0.006 -0.511 0.014
convergence code: 0
boundary (singular) fit: see ?isSingular
test.speaker.disgust.m1.pred <- dat_tidy.test.speaker.pred_grid %>%
mutate(preds = predict(test.speaker.disgust.m1, newdata=dat_tidy.test.speaker.pred_grid, allow.new.levels=T))
test.speaker.disgust.m1.pred %>%
left_join(date_grid) %>%
ggplot(aes(x = date, y = preds, color=Party)) +
geom_line() +
geom_point(aes(y = preds), alpha=.25) +
theme_apa() +
scale_colour_manual(values = c("blue", "red")) +
theme_apa() +
geom_vline(xintercept=as.numeric(as_datetime('1998-10-08')), linetype=2, alpha=.25) +
geom_vline(xintercept=as.numeric(as_datetime('1998-12-19')), linetype=2, alpha=.25) +
ggtitle('Daily expected proportion of disgust language for an average speaker' ) +
ylab('Speaker proportion disgust language') +
xlab('Date')
Joining, by = c("date_int", "date", "date_int_scaled")
test.speaker.all.m1 <- dat_tidy.test.speaker %>%
lmer(speaker_day_sent_prop ~ 1 + Party*impeachment_1 + Party*impeachment_2 + (1 | speaker) + (1 + Party | date_int) + (1 + impeachment_1 + impeachment_2 | sentiment), data=.)
Model failed to converge with max|grad| = 0.0117501 (tol = 0.002, component 1)
summary(test.speaker.all.m1)
Linear mixed model fit by REML ['lmerMod']
Formula:
speaker_day_sent_prop ~ 1 + Party * impeachment_1 + Party * impeachment_2 +
(1 | speaker) + (1 + Party | date_int) + (1 + impeachment_1 +
impeachment_2 | sentiment)
Data: .
REML criterion at convergence: -343711.1
Scaled residuals:
Min 1Q Median 3Q Max
-4.3408 -0.5221 -0.1376 0.3464 13.1826
Random effects:
Groups Name Variance Std.Dev. Corr
speaker (Intercept) 3.767e-05 0.006137
date_int (Intercept) 6.120e-06 0.002474
PartyRepublican 4.937e-06 0.002222 -0.60
sentiment (Intercept) 3.716e-04 0.019277
impeachment_1 1.127e-05 0.003357 0.84
impeachment_2 3.965e-05 0.006297 0.26 0.33
Residual 2.753e-04 0.016591
Number of obs: 64495, groups: speaker, 683; date_int, 144; sentiment, 10
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.0287953 0.0061115 4.712
PartyRepublican -0.0002893 0.0005684 -0.509
impeachment_1 0.0025763 0.0028124 0.916
impeachment_2 0.0037042 0.0033973 1.090
PartyRepublican:impeachment_1 -0.0020211 0.0024625 -0.821
PartyRepublican:impeachment_2 0.0004820 0.0027752 0.174
Correlation of Fixed Effects:
(Intr) PrtyRp impc_1 impc_2 PrR:_1
PartyRpblcn -0.049
impechmnt_1 0.313 0.022
impechmnt_2 0.147 0.020 0.080
PrtyRpbl:_1 0.002 -0.041 -0.570 -0.006
PrtyRpbl:_2 0.002 -0.040 -0.006 -0.503 0.010
convergence code: 0
Model failed to converge with max|grad| = 0.0117501 (tol = 0.002, component 1)
dat_tidy.test.speaker.pred_grid.all_sent <- expand.grid(Party = unique(dat_tidy.test.speaker$Party),
speaker = 'new_speaker',
date_int = unique(dat_tidy.test.speaker$date_int),
sentiment=unique(dat_tidy.test.speaker$sentiment))
dat_tidy.test.speaker.pred_grid.all_sent <- dat_tidy.test.speaker.pred_grid.all_sent %>%
left_join(date_grid) %>%
mutate(impeachment_1 = ifelse(date == as_datetime('1998-10-08'), 1, 0),
impeachment_2 = ifelse(date == as_datetime('1998-12-19'), 1, 0))
Joining, by = "date_int"
test.speaker.all.m1.pred <- dat_tidy.test.speaker.pred_grid.all_sent %>%
mutate(preds = predict(test.speaker.all.m1,
newdata=dat_tidy.test.speaker.pred_grid.all_sent,
allow.new.levels=T))
test.speaker.all.m1.pred %>%
left_join(date_grid) %>%
ggplot(aes(x = date, y = preds, color=Party)) +
geom_line() +
geom_point(aes(y = preds), alpha=.25) +
theme_apa() +
scale_colour_manual(values = c("blue", "red")) +
theme_apa() +
geom_vline(xintercept=as.numeric(as_datetime('1998-10-08')), linetype=2, alpha=.25) +
geom_vline(xintercept=as.numeric(as_datetime('1998-12-19')), linetype=2, alpha=.25) +
ggtitle('Daily expected proportion of disgust language for an average speaker' ) +
ylab('Speaker proportion disgust language') +
xlab('Date') + facet_wrap(sentiment~., ncol=2)
Joining, by = c("date_int", "date", "date_int_scaled")
test.speaker.all.m1.pred %>%
left_join(date_grid) %>%
ggplot(aes(x = date, y = preds, color=Party)) +
geom_line() +
geom_point(aes(y = preds), alpha=.25) +
theme_apa() +
scale_colour_manual(values = c("blue", "red")) +
theme_apa() +
geom_vline(xintercept=as.numeric(as_datetime('1998-10-08')), linetype=2, alpha=.25) +
geom_vline(xintercept=as.numeric(as_datetime('1998-12-19')), linetype=2, alpha=.25) +
ggtitle('Daily expected proportion of disgust language for an average speaker' ) +
ylab('Speaker proportion disgust language') +
xlab('Date') + facet_wrap(sentiment~., ncol=2, scales='free_y')
Joining, by = c("date_int", "date", "date_int_scaled")
sjPlot::plot_model(test.speaker.all.m1, type='re')[3]
[[1]]
No matter what you’re trying to measure or what measurement methods you’re using, you should always closely examine what you are actually measuring.
Coincidentally, there’s a very relevant quote from Bill Clinton:
From the (Wikipedia entry)[https://en.wikipedia.org/wiki/Impeachment_of_Bill_Clinton] for Clinton’s impeachment
A much-quoted statement from Clinton’s grand jury testimony showed him questioning the precise use of the word “is”. Contending his statement that “there’s nothing going on between us” had been truthful because he had no ongoing relationship with Lewinsky at the time he was questioned, Clinton said: “It depends upon what the meaning of the word ‘is’ is…”
So what are we measuring when we count words? What are we measuring with NRC lexicon?
Question: So what are we measuring when we count words? What are we measuring with NRC lexicon? Are we really measuring what we think we are?
Let’s take a look!
text_dat <- readRDS('../data/tdta_clean_house_data.RDS')
top_docs <- dat_tidy.train %>%
group_by(doc_num) %>%
mutate(doc_total_words = n()) %>%
ungroup() %>%
inner_join(nrc_sent) %>%
count(doc_num, doc_total_words, sentiment) %>%
mutate(sent_prop = n/doc_total_words) %>%
group_by(sentiment) %>%
top_n(3, wt=n) %>%
left_join(text_dat %>% select(doc_num, text))
Joining, by = "word"
Joining, by = "doc_num"
top_docs %>%
filter(sentiment=='negative') %>%
arrange(desc(sent_prop)) %>%
mutate(text_seg = str_sub(text, 1,2000)) %>%
select(-text) %>%
View()
Because the documents are so long, it’s actually quite hard to evaluate the veracity of our measurement. Another option is to look at the most frequent sentiment words in our corpus.
Question: What can we learn from looking at these words? How do you feel about our analyses?
One of the greatest strengths of dictionary-based text measurement methods is that they allow you to precisely define the construct you are interested in. This works extremely well when you are interested in specific words or types of words.
For example, if you are interested in function words, then it would never make sense to use anything other than a dictionary-based approach. Similarly, if you are truly interested in the usage of positive or negative words, then, again, it probably wouldn’t make sense to use anything other than a dictionary approach.
In these examples, there is a 1:1 relationship between the target construct and the operationalization. However, this 1:1 relationship is difficult to maintain for more abstract constructs, like “positive sentiment” or “negative sentiment”. In such cases, you (or someone else) has to decide which words evoke “positive sentiment” or “negative sentiment”.
Further, we are often interested in expressions of meaning that may operate above the word level. For instance, consider the following example:
`Let’s just say…I didn’t love it’
Most dictionary-based word count methods would estimate the sentiment expressed in this sentence as “positive” because of the token love. However, considering the entire context of this example, we can infer that the most likely sentiment is probably “negative”. Another issue related to context sensitivity is domain dependence: a word might have negative connotations in some discourse communities, but not in others.
In sum, dictionary-based word count approaches can be quite powerful; however, they have two notable shortcomings:
This does not mean that you shouldn’t use dictionary-based word count methods. However, it does mean that you should keep these short comings in mind. And, even better, you should try to account for them.
In response to some of the issues raised above, people have started trying to improve on word count methods, for instance by accounting for negation or assigning weights to sentiment words. In R, you can use the sentimentr to do these things.
We’re not going to go into detail, but sentimentr operates on the sentence level, it provides the option of assigning continuous weights to words, and attempts to account for negation by looking for patterns in user-specified windows around sentiment words. It’s built lexicon is a combination of multiple lexicons (so it might have many of the same issues we observed in the NRC), but at least it tries to handle negation out of the box.
text_dat.sentr <- text_dat %>%
filter(doc_num %!in% doc_ids_test) %>% # Select documents for training
mutate(text = gsub('[Mm]r\\.|[Hh]\\.[Rr]\\.|[Nn][Uu][Mm]\\.', 'mr', text)) %>%
mutate(sentences = get_sentences(text)) %$%
sentiment_by(sentences, by= list(doc_num))
saveRDS(text_dat.sentr, file ='../data/text_dat_train_sentimentr_scores.RDS')
# For some reason looking at text_dat.sentr crashes my notebook, so
# let's look at it in the console
Let’s look at sentiment estimated with sentimentr at the day level.
dat_tidy.train %>%
filter(Party != 'Independent') %>%
distinct(doc_num, .keep_all = T) %>%
left_join(text_dat.sentr) %>%
group_by(Party, date) %>%
summarize(mean_sent = mean(ave_sentiment)) %>%
ggplot(aes(x = date, y = mean_sent, color=Party)) +
geom_line() +
geom_vline(xintercept=as.numeric(as_datetime('1998-10-08')), linetype=2, alpha=.25) +
geom_vline(xintercept=as.numeric(as_datetime('1998-12-19')), linetype=2, alpha=.25) +
geom_point() +
theme_apa() +
ggtitle('N of documents across time by party') +
ylab('N') +
xlab('Date') +
facet_wrap(Party~., ncol=1) +
geom_hline(yintercept=0)
Joining, by = "doc_num"
Question: What does this figure suggest?
Let’s run one of our models with this data and see what it tells us…
text_dat.sentr.train <- dat_tidy.train %>%
filter(Party != 'Independent') %>%
distinct(doc_num, .keep_all = T) %>%
left_join(text_dat.sentr) %>%
group_by(Party, speaker, date) %>%
summarize(mean_sent = mean(ave_sentiment)) %>%
left_join(date_grid) %>%
mutate(date_int_scaled = date_int/100,
impeachment_1 = ifelse(date == as_datetime('1998-10-08'), 1, 0),
impeachment_2 = ifelse(date == as_datetime('1998-12-19'), 1, 0))
Joining, by = "doc_num"
Joining, by = "date"
summary(text_dat.sentr.train.m1)
Linear mixed model fit by REML ['lmerMod']
Formula: mean_sent ~ 1 + Party * impeachment_1 + Party * impeachment_2 +
(1 | speaker) + (1 + Party | date_int)
Data: text_dat.sentr.train
REML criterion at convergence: -7947.2
Scaled residuals:
Min 1Q Median 3Q Max
-4.4500 -0.6075 0.0001 0.5915 6.2702
Random effects:
Groups Name Variance Std.Dev. Corr
speaker (Intercept) 1.627e-03 0.040334
date_int (Intercept) 1.416e-03 0.037629
PartyRepublican 8.741e-05 0.009349 -0.84
Residual 1.776e-02 0.133258
Number of obs: 7171, groups: speaker, 667; date_int, 145
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.135981 0.004920 27.636
PartyRepublican -0.001476 0.005123 -0.288
impeachment_1 -0.035166 0.041857 -0.840
impeachment_2 -0.134203 0.046284 -2.900
PartyRepublican:impeachment_1 0.030942 0.027054 1.144
PartyRepublican:impeachment_2 0.008663 0.039290 0.220
Correlation of Fixed Effects:
(Intr) PrtyRp impc_1 impc_2 PrR:_1
PartyRpblcn -0.604
impechmnt_1 -0.080 0.035
impechmnt_2 -0.077 0.036 0.010
PrtyRpbl:_1 0.056 -0.078 -0.549 -0.009
PrtyRpbl:_2 0.044 -0.061 -0.007 -0.557 0.013
Question: How does this compare to our previous results?
text_dat.sentr.train.m1.pred %>%
left_join(date_grid) %>%
ggplot(aes(x = date, y = preds, color=Party)) +
geom_line() +
geom_point(aes(y = preds), alpha=.25) +
theme_apa() +
scale_colour_manual(values = c("blue", "red")) +
theme_apa() +
geom_vline(xintercept=as.numeric(as_datetime('1998-10-08')), linetype=2, alpha=.25) +
geom_vline(xintercept=as.numeric(as_datetime('1998-12-19')), linetype=2, alpha=.25) +
ggtitle('Daily expected proportion of negative language for an average speaker' ) +
ylab('Speaker proportion negative language') +
xlab('Date')
Joining, by = c("date_int", "date", "date_int_scaled")
Question: How does this compare to our previous results?
It’s also possible to work with non-tidy dictionaries in R. For instance, we can use the package quanteda to load a LIWC format dictionary and get word counts. While there are a few ways to do this, we’ll load the dictionary into a quanteda object and then convert our training corpus into a quanteda corpus object, which is just a native quanteda format. We’ll then use quanteda to create a so-called document feature matrix or dfm, using our corpus and LIWC format dictionary.
as.data.frame()
Error in as.data.frame() : argument "x" is missing, with no default